CCLE RNS-Seq Data Visualization

Fan Wang

June 10th 2022

The Cancer Cell Line Encyclopedia (CCLE) project is an effort to conduct a detailed genetic characterization of a large panel of human cancer cell lines. The CCLE provides public access analysis and visualization of DNA copy number, mRNA expression, mutation data and more, for more than 1000 cancer cell lines. This notebook demonstrates how to visualize the gene expression for cell lines of interest. CCLE gene expression data was pulled from BRH and formatted using Pandas and then visualized using Seaborn.

Accessing the CCLE Data

In [7]:
import pandas as pd
import seaborn as sns

sns.set(style="ticks", color_codes=True)
get_ipython().run_line_magic("config", "InlineBackend.figure_format = 'svg'")
import matplotlib.pyplot as plt
import seaborn as sns
import zipfile
import warnings

warnings.filterwarnings("ignore")
In [2]:
!pip install gen3 -U
!gen3 drs-pull object dg.OADC/41c3f1ac-2cc7-4b04-b09c-a9c5dbad2c98 --no-unpack-packages
Requirement already satisfied: gen3 in /opt/conda/lib/python3.9/site-packages (4.6.3)
Collecting gen3
  Downloading gen3-4.10.1-py3-none-any.whl (109 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 109.0/109.0 kB 22.0 MB/s eta 0:00:00
Requirement already satisfied: aiofiles<0.9.0,>=0.8.0 in /opt/conda/lib/python3.9/site-packages (from gen3) (0.8.0)
Collecting drsclient<0.3.0,>=0.2.1
  Downloading drsclient-0.2.1.tar.gz (7.1 kB)
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Preparing metadata (pyproject.toml) ... done
Requirement already satisfied: httpx in /opt/conda/lib/python3.9/site-packages (from gen3) (0.15.5)
Requirement already satisfied: click in /opt/conda/lib/python3.9/site-packages (from gen3) (7.1.2)
Requirement already satisfied: python-dateutil in /opt/conda/lib/python3.9/site-packages (from gen3) (2.8.2)
Requirement already satisfied: humanfriendly in /opt/conda/lib/python3.9/site-packages (from gen3) (10.0)
Requirement already satisfied: requests in /opt/conda/lib/python3.9/site-packages (from gen3) (2.27.1)
Requirement already satisfied: tqdm>=4.61.2 in /opt/conda/lib/python3.9/site-packages (from gen3) (4.64.0)
Requirement already satisfied: pandas<2.0.0,>=1.4.2 in /opt/conda/lib/python3.9/site-packages (from gen3) (1.4.2)
Requirement already satisfied: pypfb<1.0.0 in /opt/conda/lib/python3.9/site-packages (from gen3) (0.5.0)
Requirement already satisfied: indexclient>=1.6.2 in /opt/conda/lib/python3.9/site-packages (from gen3) (2.1.0)
Requirement already satisfied: jsonschema in /opt/conda/lib/python3.9/site-packages (from gen3) (4.6.0)
Requirement already satisfied: backoff in /opt/conda/lib/python3.9/site-packages (from gen3) (1.11.1)
Requirement already satisfied: dataclasses-json in /opt/conda/lib/python3.9/site-packages (from gen3) (0.5.6)
Requirement already satisfied: aiohttp in /opt/conda/lib/python3.9/site-packages (from gen3) (3.8.1)
Requirement already satisfied: asyncio<4.0.0,>=3.4.3 in /opt/conda/lib/python3.9/site-packages (from drsclient<0.3.0,>=0.2.1->gen3) (3.4.3)
Requirement already satisfied: rfc3986[idna2008]<2,>=1.3 in /opt/conda/lib/python3.9/site-packages (from httpx->gen3) (1.5.0)
Requirement already satisfied: certifi in /opt/conda/lib/python3.9/site-packages (from httpx->gen3) (2022.5.18.1)
Requirement already satisfied: sniffio in /opt/conda/lib/python3.9/site-packages (from httpx->gen3) (1.2.0)
Requirement already satisfied: httpcore==0.11.* in /opt/conda/lib/python3.9/site-packages (from httpx->gen3) (0.11.1)
Requirement already satisfied: h11<0.10,>=0.8 in /opt/conda/lib/python3.9/site-packages (from httpcore==0.11.*->httpx->gen3) (0.9.0)
Requirement already satisfied: pytz>=2020.1 in /opt/conda/lib/python3.9/site-packages (from pandas<2.0.0,>=1.4.2->gen3) (2022.1)
Requirement already satisfied: numpy>=1.18.5 in /opt/conda/lib/python3.9/site-packages (from pandas<2.0.0,>=1.4.2->gen3) (1.22.3)
Requirement already satisfied: python-json-logger<0.2.0,>=0.1.11 in /opt/conda/lib/python3.9/site-packages (from pypfb<1.0.0->gen3) (0.1.11)
Requirement already satisfied: PyYAML<6.0.0,>=5.3.1 in /opt/conda/lib/python3.9/site-packages (from pypfb<1.0.0->gen3) (5.4.1)
Requirement already satisfied: gdcdictionary<2.0.0,>=1.2.0 in /opt/conda/lib/python3.9/site-packages (from pypfb<1.0.0->gen3) (1.2.0)
Requirement already satisfied: fastavro<2.0.0,>=1.0.0 in /opt/conda/lib/python3.9/site-packages (from pypfb<1.0.0->gen3) (1.4.9)
Requirement already satisfied: dictionaryutils<=3.0.2 in /opt/conda/lib/python3.9/site-packages (from pypfb<1.0.0->gen3) (3.0.0)
Requirement already satisfied: six>=1.5 in /opt/conda/lib/python3.9/site-packages (from python-dateutil->gen3) (1.16.0)
Requirement already satisfied: idna<4,>=2.5 in /opt/conda/lib/python3.9/site-packages (from requests->gen3) (3.3)
Requirement already satisfied: charset-normalizer~=2.0.0 in /opt/conda/lib/python3.9/site-packages (from requests->gen3) (2.0.12)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /opt/conda/lib/python3.9/site-packages (from requests->gen3) (1.26.9)
Requirement already satisfied: attrs>=17.3.0 in /opt/conda/lib/python3.9/site-packages (from aiohttp->gen3) (21.4.0)
Requirement already satisfied: yarl<2.0,>=1.0 in /opt/conda/lib/python3.9/site-packages (from aiohttp->gen3) (1.7.2)
Requirement already satisfied: aiosignal>=1.1.2 in /opt/conda/lib/python3.9/site-packages (from aiohttp->gen3) (1.2.0)
Requirement already satisfied: multidict<7.0,>=4.5 in /opt/conda/lib/python3.9/site-packages (from aiohttp->gen3) (6.0.2)
Requirement already satisfied: async-timeout<5.0,>=4.0.0a3 in /opt/conda/lib/python3.9/site-packages (from aiohttp->gen3) (4.0.2)
Requirement already satisfied: frozenlist>=1.1.1 in /opt/conda/lib/python3.9/site-packages (from aiohttp->gen3) (1.3.0)
Requirement already satisfied: marshmallow-enum<2.0.0,>=1.5.1 in /opt/conda/lib/python3.9/site-packages (from dataclasses-json->gen3) (1.5.1)
Requirement already satisfied: marshmallow<4.0.0,>=3.3.0 in /opt/conda/lib/python3.9/site-packages (from dataclasses-json->gen3) (3.14.1)
Requirement already satisfied: typing-inspect>=0.4.0 in /opt/conda/lib/python3.9/site-packages (from dataclasses-json->gen3) (0.7.1)
Requirement already satisfied: pyrsistent!=0.17.0,!=0.17.1,!=0.17.2,>=0.14.0 in /opt/conda/lib/python3.9/site-packages (from jsonschema->gen3) (0.18.1)
Requirement already satisfied: cdislogging~=1.0 in /opt/conda/lib/python3.9/site-packages (from dictionaryutils<=3.0.2->pypfb<1.0.0->gen3) (1.1.1)
Requirement already satisfied: typing-extensions>=3.7.4 in /opt/conda/lib/python3.9/site-packages (from typing-inspect>=0.4.0->dataclasses-json->gen3) (4.1.1)
Requirement already satisfied: mypy-extensions>=0.3.0 in /opt/conda/lib/python3.9/site-packages (from typing-inspect>=0.4.0->dataclasses-json->gen3) (0.4.3)
Building wheels for collected packages: drsclient
  Building wheel for drsclient (pyproject.toml) ... done
  Created wheel for drsclient: filename=drsclient-0.2.1-py3-none-any.whl size=7417 sha256=39f08cecc9ea43b7cad4684ca07d36a801b1e8c9bd15384f3ac3c902ee30f05a
  Stored in directory: /home/jovyan/.cache/pip/wheels/fa/14/cf/4242c44ed310a955ad62575b92f4b81c24852e388dca1d6385
Successfully built drsclient
Installing collected packages: drsclient, gen3
  Attempting uninstall: drsclient
    Found existing installation: drsclient 0.1.4
    Uninstalling drsclient-0.1.4:
      Successfully uninstalled drsclient-0.1.4
  Attempting uninstall: gen3
    Found existing installation: gen3 4.6.3
    Uninstalling gen3-4.6.3:
      Successfully uninstalled gen3-4.6.3
Successfully installed drsclient-0.2.1 gen3-4.10.1
In [5]:
path_to_zip_file_expression ="CCLE_data_22Q2.zip"

with zipfile.ZipFile(path_to_zip_file_expression, 'r') as zip_ref:
    zip_ref.extractall()

Extraction of top 2000 genes with the largest variation

In [6]:
# Read sample_info and insert only cell lines with lineage 'lung'.
sample_info = pd.read_csv("CCLE_data_22Q2/sample_info_22Q2.csv")
lung = sample_info[sample_info["lineage"] == "lung"]

# Read the expression data and edit the gene column name.
expression = pd.read_csv("CCLE_data_22Q2/CCLE_expression_22Q2.csv", sep=",")
expression.columns = [line.split(" ")[0] for line in expression.columns.to_list()]
expression.rename(columns={"Unnamed:": "Cell Line"}, inplace=True)

# Merge expression data of cells of blood lineage into data frame called lungexpressAll.
lungexpressAll = pd.merge(
    lung, expression, left_on="DepMap_ID", right_on="Cell Line", how="inner"
)

# Create a list of unnecessary columns and delete them.
removecolumnlist = lung.columns.to_list()
removecolumnlist.remove("stripped_cell_line_name")
removecolumnlist.append("Cell Line")
lungexpressAll.drop(removecolumnlist, axis=1, inplace=True)
lungexpressAll.set_index("stripped_cell_line_name", inplace=True)

# Swap columns and indexes.
LE = lungexpressAll.transpose()

# Select the top 2000 HVGs with the largest variation,
Top2000 = LE.var(axis=1).sort_values(ascending=False)[0:2000]
val = LE.loc[Top2000.index]

Comparison of expression similarity between cell lines by correlation

In [7]:
correlation = val.corr()
correlation
Out[7]:
stripped_cell_line_name NCIH1819 SALE NCIH1618 NCIH889 NCIH1184 SQ1 JL1 NCIH854 NCIH1385 T3M10 ... DV90 NCIH2110 NCIH650 NCIH2342 HCC2450 NCIH1155 NCIH292 LC1SQSF SCLC22H LU135
stripped_cell_line_name
NCIH1819 1.000000 0.485578 0.287696 0.167879 0.187428 0.516792 0.345822 0.358988 0.192794 0.420964 ... 0.410416 0.530498 0.341989 0.692735 0.400669 0.227765 0.507148 0.421654 0.158397 0.140666
SALE 0.485578 1.000000 0.074010 -0.033752 -0.048017 0.693487 0.488306 0.378980 0.027095 0.672050 ... 0.496662 0.586875 0.456705 0.512722 0.424270 0.053416 0.781673 0.464621 0.001774 -0.002607
NCIH1618 0.287696 0.074010 1.000000 0.605183 0.685070 0.111884 0.064536 0.302465 0.525158 0.020139 ... 0.152469 0.144963 0.074072 0.236139 0.068344 0.514212 0.084972 0.225485 0.579617 0.549149
NCIH889 0.167879 -0.033752 0.605183 1.000000 0.587836 0.010820 0.017375 0.212685 0.536791 -0.018484 ... 0.094586 0.022048 0.070789 0.095501 0.066186 0.534221 -0.052797 0.146696 0.615936 0.482970
NCIH1184 0.187428 -0.048017 0.685070 0.587836 1.000000 -0.026200 0.132159 0.134854 0.477524 -0.048107 ... 0.018895 -0.036299 0.133998 0.060013 0.125298 0.590960 -0.051525 0.199748 0.531780 0.620065
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
NCIH1155 0.227765 0.053416 0.514212 0.534221 0.590960 0.086377 0.179788 0.263468 0.490881 0.082765 ... 0.141748 0.071152 0.243054 0.137546 0.220530 1.000000 0.045065 0.226532 0.554269 0.631677
NCIH292 0.507148 0.781673 0.084972 -0.052797 -0.051525 0.660441 0.504858 0.342340 0.025813 0.677222 ... 0.487584 0.645951 0.386423 0.544095 0.413807 0.045065 1.000000 0.453884 -0.015345 -0.004720
LC1SQSF 0.421654 0.464621 0.225485 0.146696 0.199748 0.392014 0.432092 0.269084 0.248377 0.436422 ... 0.349939 0.421808 0.384609 0.411621 0.410761 0.226532 0.453884 1.000000 0.141574 0.239485
SCLC22H 0.158397 0.001774 0.579617 0.615936 0.531780 0.038019 0.047426 0.260327 0.500969 0.010126 ... 0.086771 0.074118 0.030135 0.120127 0.046249 0.554269 -0.015345 0.141574 1.000000 0.601241
LU135 0.140666 -0.002607 0.549149 0.482970 0.620065 0.012657 0.161111 0.176000 0.450627 0.017812 ... 0.041457 0.034500 0.099499 0.094428 0.124505 0.631677 -0.004720 0.239485 0.601241 1.000000

207 rows × 207 columns

Visualize data matrix as heatmap. The higher correlation, the more similar cell lines

In [8]:
plt.figure(figsize=(15,15))
sns.heatmap (correlation)
Out[8]:
<AxesSubplot:xlabel='stripped_cell_line_name', ylabel='stripped_cell_line_name'>
2022-06-13T13:49:13.961557 image/svg+xml Matplotlib v3.3.4, https://matplotlib.org/

Let's see the difference according to lineage subtypes

1. Non-Small Cell Lung Cancer (NSCLC) lineage subtype

In [9]:
SCLC = lung[lung["lineage_subtype"] == "NSCLC"]
SCLCexpressAll = pd.merge(
    SCLC, expression, left_on="DepMap_ID", right_on="Cell Line", how="inner"
)
removecolumnlist = SCLC.columns.to_list()
removecolumnlist.remove("stripped_cell_line_name")
removecolumnlist.append("Cell Line")
SCLCexpressAll.drop(removecolumnlist, axis=1, inplace=True)
SCLCexpressAll.set_index("stripped_cell_line_name", inplace=True)

SCLC_expression = SCLCexpressAll.transpose()

Top2000 = SCLC_expression.var(axis=1).sort_values(ascending=False)[0:2000]
val = SCLC_expression.loc[Top2000.index]
correlation = val.corr()
plt.figure(figsize=(15, 15))
sns.heatmap(correlation)
Out[9]:
<AxesSubplot:xlabel='stripped_cell_line_name', ylabel='stripped_cell_line_name'>
2022-06-13T13:49:29.785082 image/svg+xml Matplotlib v3.3.4, https://matplotlib.org/

2. Small Cell Lung Cancer (SCLC) lineage subtype

In [10]:
SCLC = lung[lung["lineage_subtype"] == "SCLC"]
SCLCexpressAll = pd.merge(
    SCLC, expression, left_on="DepMap_ID", right_on="Cell Line", how="inner"
)
removecolumnlist = SCLC.columns.to_list()
removecolumnlist.remove("stripped_cell_line_name")
removecolumnlist.append("Cell Line")
SCLCexpressAll.drop(removecolumnlist, axis=1, inplace=True)
SCLCexpressAll.set_index("stripped_cell_line_name", inplace=True)

SCLC_expression = SCLCexpressAll.transpose()

Top2000 = SCLC_expression.var(axis=1).sort_values(ascending=False)[0:2000]
val = SCLC_expression.loc[Top2000.index]
correlation = val.corr()
plt.figure(figsize=(15, 15))
sns.heatmap(correlation)
Out[10]:
<AxesSubplot:xlabel='stripped_cell_line_name', ylabel='stripped_cell_line_name'>
2022-06-13T13:49:38.716298 image/svg+xml Matplotlib v3.3.4, https://matplotlib.org/

Cell lines with similar gene expression patterns can be grouped together using Clustermap

In [11]:
plt.figure(figsize=(13,13))

g = sns.clustermap(val,figsize=(13,13))
<Figure size 936x936 with 0 Axes>
2022-06-13T13:50:06.774150 image/svg+xml Matplotlib v3.3.4, https://matplotlib.org/

From Above, SW1271 cell line (second column) and NCIH2286 cell line (third column) appear to be clustered together, but the distance between these two cell lines and NCIH2171 cell line (last column) is far apart. Let's look at the distribution according to gene expression for each cell line

In [12]:
twoset = val[["SW1271", "NCIH2286"]]
twoset
Out[12]:
stripped_cell_line_name SW1271 NCIH2286
ASCL1 0.163499 0.678072
RPS4Y1 0.722466 0.378512
GRP 0.176323 0.275007
CALCA 0.214125 1.028569
XAGE1A 10.048623 0.678072
... ... ...
KIRREL2 1.327687 3.941106
TSPAN13 1.855990 2.646163
KDELR3 4.369466 4.486071
VAV3 0.389567 1.835924
DEPP1 0.941106 0.070389

2000 rows × 2 columns

Let's visualize the relationship between SW1271 and NCIH2286 with a scatter plot and a regression line

In [13]:
sns.lmplot(data=twoset, x="SW1271", y="NCIH2286")
Out[13]:
<seaborn.axisgrid.FacetGrid at 0x7f8324765940>
2022-06-13T13:50:36.948534 image/svg+xml Matplotlib v3.3.4, https://matplotlib.org/

Find the correlation between the expression of two cell lines

In [14]:
twoset["SW1271"].corr(twoset["NCIH2286"])
Out[14]:
0.7532543855591219

The correlation coefficient is 0.75, which shows a fairly high correlation. Being located nearby in the clustermap means that the same gene is expressed in a similar manner in two cell lines.

Now let's compare two distant cell lines in the clustermap

In [15]:
otherset = val[["SW1271", "NCIH2171"]]
sns.lmplot(data=otherset, x="SW1271", y="NCIH2171")
Out[15]:
<seaborn.axisgrid.FacetGrid at 0x7f8344aa5390>
2022-06-13T13:50:49.144430 image/svg+xml Matplotlib v3.3.4, https://matplotlib.org/
In [16]:
otherset["SW1271"].corr(otherset["NCIH2171"])
Out[16]:
0.09937315675624152

As we expected, the correlation is low between two distant cell lines in the clustermap.

Visualize correlations for multiple cell lines

In [17]:
sets = val[["SW1271", "NCIH2286", "NCIH2171", "NCIH196"]]
g = sns.PairGrid(sets)
g.map(sns.scatterplot)
Out[17]:
<seaborn.axisgrid.PairGrid at 0x7f8343b82828>
2022-06-13T13:50:58.681250 image/svg+xml Matplotlib v3.3.4, https://matplotlib.org/
In [18]:
sets = val[["SW1271", "NCIH2286", "NCIH2171", "NCIH196"]]
g = sns.PairGrid(sets)
g.map_diag(sns.histplot)
g.map_offdiag(sns.scatterplot)
Out[18]:
<seaborn.axisgrid.PairGrid at 0x7f8324e44ba8>
2022-06-13T13:51:09.746811 image/svg+xml Matplotlib v3.3.4, https://matplotlib.org/
In [19]:
g = sns.PairGrid(sets, diag_sharey=False, corner=True)
g.map_lower(sns.scatterplot)
g.map_diag(sns.kdeplot)
Out[19]:
<seaborn.axisgrid.PairGrid at 0x7f8324af0f60>
2022-06-13T13:51:31.422463 image/svg+xml Matplotlib v3.3.4, https://matplotlib.org/

Summary

We retrieved CCLE expression dataset from BRH. Then we focused on the top 2000 genes with the largest variation and ploted a clustermap for these highly variable genes (HVG). From the clustermap, nearby and distant cell lines were selected and correlation was visualized. Then we ploted the correlation for muiltiple cell lines as a scatter plot using PairGrid.